68 research outputs found
Speech vocoding for laboratory phonology
Using phonological speech vocoding, we propose a platform for exploring
relations between phonology and speech processing, and in broader terms, for
exploring relations between the abstract and physical structures of a speech
signal. Our goal is to make a step towards bridging phonology and speech
processing and to contribute to the program of Laboratory Phonology. We show
three application examples for laboratory phonology: compositional phonological
speech modelling, a comparison of phonological systems and an experimental
phonological parametric text-to-speech (TTS) system. The featural
representations of the following three phonological systems are considered in
this work: (i) Government Phonology (GP), (ii) the Sound Pattern of English
(SPE), and (iii) the extended SPE (eSPE). Comparing GP- and eSPE-based vocoded
speech, we conclude that the latter achieves slightly better results than the
former. However, GP - the most compact phonological speech representation -
performs comparably to the systems with a higher number of phonological
features. The parametric TTS based on phonological speech representation, and
trained from an unlabelled audiobook in an unsupervised manner, achieves
intelligibility of 85% of the state-of-the-art parametric speech synthesis. We
envision that the presented approach paves the way for researchers in both
fields to form meaningful hypotheses that are explicitly testable using the
concepts developed and exemplified in this paper. On the one hand, laboratory
phonologists might test the applied concepts of their theoretical models, and
on the other hand, the speech processing community may utilize the concepts
developed for the theoretical phonological models for improvements of the
current state-of-the-art applications
Deep speech inpainting of time-frequency masks
Transient loud intrusions, often occurring in noisy environments, can
completely overpower speech signal and lead to an inevitable loss of
information. While existing algorithms for noise suppression can yield
impressive results, their efficacy remains limited for very low signal-to-noise
ratios or when parts of the signal are missing. To address these limitations,
here we propose an end-to-end framework for speech inpainting, the
context-based retrieval of missing or severely distorted parts of
time-frequency representation of speech. The framework is based on a
convolutional U-Net trained via deep feature losses, obtained using speechVGG,
a deep speech feature extractor pre-trained on an auxiliary word classification
task. Our evaluation results demonstrate that the proposed framework can
recover large portions of missing or distorted time-frequency representation of
speech, up to 400 ms and 3.2 kHz in bandwidth. In particular, our approach
provided a substantial increase in STOI & PESQ objective metrics of the
initially corrupted speech samples. Notably, using deep feature losses to train
the framework led to the best results, as compared to conventional approaches.Comment: Accepted to InterSpeech202
Spiking neural networks trained with backpropagation for low power neuromorphic implementation of voice activity detection
Recent advances in Voice Activity Detection (VAD) are driven by artificial
and Recurrent Neural Networks (RNNs), however, using a VAD system in
battery-operated devices requires further power efficiency. This can be
achieved by neuromorphic hardware, which enables Spiking Neural Networks (SNNs)
to perform inference at very low energy consumption. Spiking networks are
characterized by their ability to process information efficiently, in a sparse
cascade of binary events in time called spikes. However, a big performance gap
separates artificial from spiking networks, mostly due to a lack of powerful
SNN training algorithms. To overcome this problem we exploit an SNN model that
can be recast into an RNN-like model and trained with known deep learning
techniques. We describe an SNN training procedure that achieves low spiking
activity and pruning algorithms to remove 85% of the network connections with
no performance loss. The model achieves state-of-the-art performance with a
fraction of power consumption comparing to other methods.Comment: 5 pages, 2 figures, 2 table
An Analysis of Rhythmic Staccato-Vocalization Based on Frequency Demodulation for Laughter Detection in Conversational Meetings
Human laugh is able to convey various kinds of meanings in human
communications. There exists various kinds of human laugh signal, for example:
vocalized laugh and non vocalized laugh. Following the theories of psychology,
among all the vocalized laugh type, rhythmic staccato-vocalization
significantly evokes the positive responses in the interactions. In this paper
we attempt to exploit this observation to detect human laugh occurrences, i.e.,
the laughter, in multiparty conversations from the AMI meeting corpus. First,
we separate the high energy frames from speech, leaving out the low energy
frames through power spectral density estimation. We borrow the algorithm of
rhythm detection from the area of music analysis to use that on the high energy
frames. Finally, we detect rhythmic laugh frames, analyzing the candidate
rhythmic frames using statistics. This novel approach for detection of
`positive' rhythmic human laughter performs better than the standard laughter
classification baseline.Comment: 5 pages, 1 figure, conference pape
ALO-VC: Any-to-any Low-latency One-shot Voice Conversion
This paper presents ALO-VC, a non-parallel low-latency one-shot phonetic
posteriorgrams (PPGs) based voice conversion method. ALO-VC enables any-to-any
voice conversion using only one utterance from the target speaker, with only
47.5 ms future look-ahead. The proposed hybrid signal processing and machine
learning pipeline combines a pre-trained speaker encoder, a pitch predictor to
predict the converted speech's prosody, and positional encoding to convey the
phoneme's location information. We introduce two system versions: ALO-VC-R,
which uses a pre-trained d-vector speaker encoder, and ALO-VC-E, which improves
performance using the ECAPA-TDNN speaker encoder. The experimental results
demonstrate both ALO-VC-R and ALO-VC-E can achieve comparable performance to
non-causal baseline systems on the VCTK dataset and two out-of-domain datasets.
Furthermore, both proposed systems can be deployed on a single CPU core with 55
ms latency and 0.78 real-time factor. Our demo is available online.Comment: Accepted to Interspeech 2023. Some audio samples are available at
https://bohan7.github.io/ALO-VC-demo
MOSRA: Joint Mean Opinion Score and Room Acoustics Speech Quality Assessment
The acoustic environment can degrade speech quality during communication
(e.g., video call, remote presentation, outside voice recording), and its
impact is often unknown. Objective metrics for speech quality have proven
challenging to develop given the multi-dimensionality of factors that affect
speech quality and the difficulty of collecting labeled data. Hypothesizing the
impact of acoustics on speech quality, this paper presents MOSRA: a
non-intrusive multi-dimensional speech quality metric that can predict room
acoustics parameters (SNR, STI, T60, DRR, and C50) alongside the overall mean
opinion score (MOS) for speech quality. By explicitly optimizing the model to
learn these room acoustics parameters, we can extract more informative features
and improve the generalization for the MOS task when the training data is
limited. Furthermore, we also show that this joint training method enhances the
blind estimation of room acoustics, improving the performance of current
state-of-the-art models. An additional side-effect of this joint prediction is
the improvement in the explainability of the predictions, which is a valuable
feature for many applications.Comment: Submitted to Interspeech 202
Baseline System for Automatic Speech Recognition with French GlobalPhone Database
This report presents one month trainee work on development of French Automatic Speech Recognition ASR system using a french part of multilingual database GlobalPhone_FR. The purpose of this report is to explain and give results of the training and testing of the ASR with this specific database. Two different methods are presented, the Hidden Markov Model (HMM) with MFCC/PLP features and tandem features from Multilayer Perceptron (MLP) phone posteriors. The report presents data preparation for GlobalPhone_FR ASR training, and compares the two different approaches. Word recognition accuracy achieved with MFCC features is 71.46% and the tandem features with 3-layer MLP improved the accuracy to 72.15%. We interpret this result as a baseline for the GlobalPhone_FR database
- …